Bamboo - Preliminary scaling results on multiple hybrid nodes of Knights Corner and Sandy Bridge processors

نویسندگان

  • Tan Nguyen
  • Scott B. Baden
چکیده

We discuss our experience in using Bamboo to automatically optimize a stencil method on an Intel Xeon Phi-based cluster. We describe our solutions to three challenges: tolerating the high cost of inter-node communication, mapping program parallelism to multicore and many-core processors, and balancing workloads on-node across heterogeneous resources. We present results on TACC’s Stampede system. While details of how to optimize stencil methods will differ from that of other applications motifs, the 3 issues we have brought up will apply as well. Keywords—MIC, Intel Xeon Phi, data-driven, virtualization, latency hiding, load balancing, heterogeneity.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Task-Based Cholesky Decomposition on Knights Corner Using OpenMP

The growing popularity of the Intel Xeon Phi coprocessors and the continued development of this new many-core architecture have created the need for an open-source, scalable, and cross-platform taskbased dense linear algebra package that can efficiently use this type of hardware. In this paper, we examined the design modifications necessary when porting PLASMA, a task-based dense linear algebra...

متن کامل

Implementation and Optimization of miniGMG — a Compact Geometric Multigrid Benchmark

Multigrid methods are widely used to accelerate the convergence of iterative solvers for linear systems used in a number of different application areas. In this report, we describe miniGMG, our compact geometric multigrid benchmark designed to proxy the multigrid solves found in AMR applications. We explore optimization techniques for geometric multigrid on existing and emerging multicore syste...

متن کامل

Energy Efficiency Effects of Vectorization in Data Reuse Transformations for Many-Core Processors—A Case Study

Thread-level and data-level parallel architectures have become the design of choice in many of today’s energy-efficient computing systems. However, these architectures put substantially higher requirements on the memory subsystem than scalar architectures, making memory latency and bandwidth critical in their overall efficiency. Data reuse exploration aims at reducing the pressure on the memory...

متن کامل

OpenMP Parallelization and Optimization of Graph-Based Machine Learning Algorithms

We investigate the OpenMP parallelization and optimization of two novel data classification algorithms. The new algorithms are based on graph and PDE solution techniques and provide significant accuracy and performance advantages over traditional data classification algorithms in serial mode. The methods leverage the Nystrom extension to calculate eigenvalue/eigenvectors of the graph Laplacian ...

متن کامل

DD-αAMG on QPACE 3

We describe our experience porting the Regensburg implementation of the DD-αAMG solver from QPACE 2 to QPACE 3. We first review how the code was ported from the first generation Intel Xeon Phi processor (Knights Corner) to its successor (Knights Landing). We then describe the modifications in the communication library necessitated by the switch from InfiniBand to Omni-Path. Finally, we present ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2013